logo


Audience: Diverse Background


Time: 1 day workshop (6 hours)


Pre-Requisites: Completion of NLP Introductory course developed by Data Science campus team and fulfillment of all prerequisites stated there.


Brief Description: This course will focus on key topics in NLP that are driven by machine learning. It seeks to illustrate how text can be used for predictive modelling in key problem domains.


Aims, Objectives and Intended Learning Outcomes: All participants through completion of this course should then have rigourous understanding of the pipeline to clean and preprocess natural language and steps involved in preparing the data for machine learning algorithms. It also shows how advanced statistical techniques such topic modelling could be used to unearth topics in huge corpora.


Dataset: BBC News headline Dataset, Airline tweets sentiments Dataset, IATI (descriptions on aid activity) dataset


Libraries: Before attending the course please make sure that you read the course instructions that you received.


Acknowledgements: Many thanks to Paraskevi Pericleous, Isabela Breton and Dan Lewis for reviewing the material. Many thanks to the Data Science Campus team based at AH, East Kilbride for reviewing course and commentary. Also thanks to everyone who attended the pilot course to provide feedback about the course.




Chapter 1: Topic Modelling


Intended Learning Outcomes

1.1 What is Topic Modelling


 
 


 

Topic Modelling is a method for discovering topics in a document collection. Topic is a collection of dominant keywords. Looking at the keywords, you can identify the topic.

There are many algorithms used for Topic Modelling, such as latent semantic analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), LDA (Bayesian version of pLSA), lda2vec (extension of word2vec and LDA).

In this course we will only focus on Latent Dirichlet Allocation (LDA) from the gensim library, which seems to be the most popular one.

When using LDA, each document in a set of documents is considered as a mixtrure of various topics. These topics are assigned to the document via the LDA. The assumption is that the topic distribution has a sparse Dirichlet prior, which includes the intuition that documents include a small set of topics and topics include a small set of frequent words. Topics are identified based on the likelihood of having co-occurrent terms.

1.2 Building a simple LDA Model

The quality of the model depends on the text pre-processing, the choice of the algorithm, the number of topics we choose, the variety of the text topics and the parameter tuning.

Steps:

  1. Create a dictionary using corpora.Dictionary

  2. Create the corpus and get the frequencies of the words using doc2bow

  3. Build and Train the LDA model using gensim.models.ldamodel.LdaModel

  4. View and Interpret Topics using print_topics

  5. Compute model perplexity using log_perplexity. Perplexity is a measure of how well the model can predict the observed values in our sample. We can estimate perplexity for different LDA models. It is considered that the model with the lowest Perplexity performs better in predicting the observed values.

  6. Compute model coherence using CoherenceModel. Coherence Score is a measure of how well the topic model can be humanly-intrepreted (= assessing the quality of the learned topics). A good model will provide coherent topics. The higher the coherence is, the better the topics are extracted.

  7. Visualize topics and keywords using pyLDAvis.gensim.prepare

  8. Find the optimal number of topics by building many LDA models with different values of number of topics and pick the one that gives the highest coherence value.

  9. Find the dominant topic in document

  10. Find the most representative document for each topic

  11. Find the topic distribution across documents


Chapter 2: Information Retrieval


Intended Learning Outcomes: By the end of Chapter 2 you should be able to:-

  • Define key terms in Information Retrieval (IR).

  • List at a high level of abstraction key steps in developing an IR application.

  • Describe how IR can be challenging

  • Describe an Inverted Index

  • Set up an inverted index for a document collection in Python using SCI Learn

  • Define 3 models used to build an IR application

  • Describe the Boolean Retrieval Model

  • Set up a Boolean Retrieval search over a document collection

  • Describe VSM approach to IR

  • Set up a VSM based IR program over a document collection

  • Describe Language Modelling approach to IR

  • Calculate maximum liklihood estimates for terms in a document collection.

  • Apply Linear Interpolation to query/document to determine a probability score for query/document.


2.1 What is Information Retrieval (IR)?


The meaning of the term Information Retrieval can be quite broad.

Every time you look up information to get a task done could be considered IR.

A useful definition given by Manning (2009):

IR is finding material (usually documents) of an unstructured nature (usually text) that satifies an information need from within larger collections (usually stored on computers)


Key Terms used in Information Retrieval


An information need is the topic about which the user desires to know more about.

A query is what the user conveys to the computer in an attempt to communicate the information need.

A document is an information entity the user wants to retrieve.

A document is relevant if the user perceives that it contains information of value with respect to their personal information need.

A collection is a set of documents.

A term is a word or concept that appears in a document or query

An index is a representation of information that makes querying easier



Information Retrieval vs Web Search

IR is more than web search

IR is concerned with the finding of (any kind of) relevant information

Up until a few decades ago, people preferred to get information from other people eg booking travel via a human travel agent, librarians to search for books, paralegals etc. It used to be an activity only a few people engaged in.

The world has changed, hundreds of millions of peope engage in information retrieval (IR) every data through web search. However many other cases of IR eg email search, searching your laptop, interrogating corporate knowledge bases are also commonplace examples of search.

Information retrieval has overtaken database retrieval as most information does not reside in database systems.



 

2.2 The Mechanics of Information Retrieval


 


 

2.3 The Central problem in IR


 


 


 

Related to the above are the following issues:

  1. Document and query Indexing
    How to best represent their contents
  2. Query Evaluation(or retrieval process)
    To what extent does a document correspond to a query.
  3. System Evaluation
    How good is a system ? Are the retrieved documents relevant(precision).
  4. Are all relevant documents retrieved (recall).
  5. Relevant documents need to be found very quickly from vast quantities of data (100’s billions pages in some cases).


 

Questions to tackle in retrieval

  • How is a document represented with the selected keywords ?
  • How are document and query representations compared to calculate a score ?


 

The task in information retrieval is this: we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. The idea of relevance has slipped into the discussion. It is this notion which is at the centre of information retrieval. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. An IR system should generate a ranking which reflects relevance.

Most search engines use bag of words to build retrieval models. The document is treated as a bag of words


 

2.4 Document Representation: The Inverted Index


 

Basic Concept: Each document is described by a set of representative keywords called index terms.
 

Assign a numerical weight to index terms


 


 

   


   


   


   


   


   

The above index is often represented as a dictionary file of terms with an associated postings file.
 

This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.


 

Diving into Code


inverted_index_example = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ] 


def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    return data


def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       new_list = []
       for index, entry in enumerate(data):
           no_stopwords = ""
           entry = entry.split()
           for word in entry:
               if word not in stop_words:
                    no_stopwords = no_stopwords + " " + word 
           new_list.append(no_stopwords)
       return new_list


inverted_index_example = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(inverted_index_example)))

vectorizer = CountVectorizer()
inverted_index_vectorised = vectorizer.fit_transform(inverted_index_example)

#if u want to look at it
tdm = pd.DataFrame(inverted_index_vectorised.toarray(), columns = vectorizer.get_feature_names())
print (tdm.transpose())
##        0  1  2  3  4
## drink  1  3  1  1  1
## ink    0  0  1  1  1
## likes  2  1  1  1  1
## pink   0  0  0  1  1
## thing  0  0  1  0  0
## wink   1  0  0  0  1


   

The following can be said about the inverted index:-
  • It maps terms to the documents that contain them. It “inverts” the collection (which maps documents to the words they contain)
  • It permit us to answer boolean queries without visiting entire corpus
  • It is slow to construct (requires visiting entire corpus) but this only needs to be done once
  • It can be used for any number of queries
  • It can be done before any queries have been seen
 


   

Exercise:
 

  1. Set up an inverted index using Python that would be built for the following document collection. Basic preprocessing should also be undertaken on the data.

  Doc 1: New home sales top forecasts.
  Doc 2: Home sales rise, in July!
  Doc 3 Increase in home sales, in July.
  Doc 4 July new home sales rise.
 

  1. Write a Python function that will take 2 words (eg “home” and “sales”) and returns document(s) that contains both the words.

 

Optional Extra:

 

Find documents matching query “pink ink”

 

  1. Find document containing both words

  2. Both words has to be a phrase

 

We could have a bi-gram index

 


 

Bi-gram index issues:
  Fast but index size will explode
  What aboout trigram phrases
  What about proximity? “ink is pink”
 

A possible solution: Proximity Index

 

Term position is embedded to the inverted index

 

Called proximity/positional index
  Enables phrase and proximity search


   


   

Implement positional inverted index on data shown below.
  You need to save the following information in terms inverted lists:
  - term (pre-processed) and its document frequency
  - list of documents where this term occured
  - for each document, list of positions where the term occured within the document
 

Doc 1: breakthrough drug for schizophrenia
  Doc 2: new schizophrenia drug
  Doc 3: new approach for treatment of schizophrenia
  Doc 4: new hopes for schizophrenia patients


 

2.5 Taxonomy of Classical IR Models


 

For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes.

A retrieval model specifies the details of:
  • Document representation
  • Query representation
  • Retrieval function: how to find relevant results
  • Determines a notion of relevance
 

In classical IR models a document is described as a set of representative keywords - index terms . Each term is assigned a numerical weight to determine relavance.


   


 

2.6 Classical IR Models: Boolean Retrieval


 

The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly GREP referred to as grepping through text, after the Unix command grep, which performs this process. However,searching through large collections (billions to trillions of words) is unacceptably slow. More flexible matching operations require ranked retrieval.

One alternative to linearly scanning is to index the documents in advance.

Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used ( INCIDENCE MATRIX about 32,000 different words). The result is a binary term-document incidence matrix, as in Figure. Terms that are indexed are usually words.


   


   

We can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it. To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100.
  The answers for this query are thus Antony and Cleopatra and Hamlet.


 


 

Question

  1. Contruct the term-document incidence matrix for documents in the Python list below.
  2. What are the returned results for query

    1. drink AND ink AND NOT pink


 

Diving into Code



data = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ] 

data = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(data)))

binary_vectorizer = CountVectorizer(binary=True)
counts = binary_vectorizer.fit_transform(data)

#if u want to look at it
tdm = pd.DataFrame(counts.toarray(), columns = binary_vectorizer.get_feature_names())
tdm=tdm.transpose()
print (tdm)
##        0  1  2  3  4
## drink  1  1  1  1  1
## ink    0  0  1  1  1
## likes  1  1  1  1  1
## pink   0  0  0  1  1
## thing  0  0  1  0  0
## wink   1  0  0  0  1
def NOT(pterm): 
    for a in range(len(pterm)):
        if(pterm[a] == 0): 
            pterm[a] = 1
        elif(pterm[a] == 1): 
           pterm[a] = 0
    return pterm


term1 =  list(tdm.loc['drink'])
term2 = list(tdm.loc['ink'])
term3 =  NOT(list(tdm.loc['pink']))
terms = list(zip(term1, term2, term3))

vector= [terms[item][0] & terms[item][1] & terms[item][2]for item in range(len(terms))] 

for i in vector:
    if i == 1:
        print ("Document", vector.index(i), "meets search term criteria")
## Document 2 meets search term criteria


 

The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words


The following can be said of the Boolean Retrieval Model:-
  • It can answer any query that is made up of boolean expressions
  • Boolean queries are queries that use and, or and not to join query terms
  • Views each document as a set of terms
  • It is precise - document matches conditions or not
  • Primary commercial retrieval tool for 3 decades
  • Many professional searchers (e.g., lawyers) still like boolean queries
  • You know exactly what you are getting
  • It does not have a built-in way of ranking matched documents by some notion of relevance
  • It is easy to understand. Clean formalism
  • It is too complex for web users
  • Incidence matrix is impractical for big collections
 

Exercise:


Consider these documents:
  Doc 1: breakthrough drug for schizophrenia
  Doc 2: new schizophrenia drug
  Doc 3: new approach for treatment of schizophrenia
  Doc 4: new hopes for schizophrenia patients

  1. Draw the term-document incidence matrix for this document collection
     
  2. What are the returned results for query
      schizophrenia AND drug


2.7 Classical IR Models: Vector Space Model

 

The representation of a set of documents as vectors in a common vector space is known as the Vector Space Model . Every distinct word has one dimension.


Key idea: Documents and queries are vectors in a high-dimensional space.
 

Key issues:
 

• What to select as the dimensions of this space?
  • How to convert documents and queries into vectors?
  • How to compare queries with documents in this space?
 

The Vector Space Model assumes that
 

• the degree of matching can be used to rank-order documents;
  • this rank-ordering corresponds to how well a document satisfying a user’s information need
 

Steps in Vector Space Modelling

  • Convert the query to a vector of terms
  • Weight each component.
  • Consult the index to find all documents containing each term
  • Convert each document to a weighted vector
  • Query and documents mapped to vectors and their angles compared
  • Match the query vector against each document vector and sort the documents by their similarity
  • Similarity based on occurrence frequencies of keywords in query and document
  • Output documents are ranked according to similarity to query
 

Challenges
 

• Finding a good set of basis vectors.
  • Finding a good weighting scheme for terms, since model provides no guidance.
Usually variations on (length normalised) tf*idf
  • Finding a comparison function, since again the model provides no guidance. Usually cosine comparison.  

Comments on Vector Space Models
  • Simple, practical, and mathematically based approach
  • Lacks the control of a Boolean model (e.g., requiring a term to appear in a document)
 

Overall, Vector Space Models are hard to beat
 

Consider below documents and a query term
 

Document 1: Cat runs behind rat
  Document 2: Dog runs behind cat
  Query: rat
 

A term document matrix would be set up. This is a way is a way of representing documents vectors
  in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms.
 

Term weights are calculated for all the terms in the matrix across all the documents.
 

A word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf) which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents. Tf-idf = tf X idf
 

Similarity Measures: cosine similarity
 

Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. The cosine angle between each document vector and the query vector is calculated to find its closeness. To find relevant document to the query term , the similarity score between each document vector and the query term vector by is calculated by applying cosine similarity . Whichever documents have a high similarity score will be considered as relevant documents to the query term.
 


 


 


 

Summary on VSM
     

 

Diving into Code

 

The IATI dataset will be used, further details on this dataset can be found here https://iatistandard.org/en/iati-standard/ The dataset used below is a subset which provides description on aid activity undertaken by various organisation in the aid sector around the world.



import operator
import pandas as pd
import re
import sklearn
from sklearn.decomposition import PCA
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.collocations import *
from nltk.corpus import webtext
import numpy as np
import random
import pickle
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity  


def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    return data


def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       new_list = []
       for index, entry in enumerate(data):
           no_stopwords = ""
           entry = entry.split()
           for word in entry:
               if word not in stop_words:
                    no_stopwords = no_stopwords + " " + word 
           new_list.append(no_stopwords)
       return new_list
   
    
def stemming (data):
    st = PorterStemmer()
    for index, entry in enumerate(data):
        data[index] = st.stem(entry)
    return data
  
def read_data():
    raw_data_orig  = open("C:/IR Course/Adv -IR/IATI3.pkl","rb")
    raw_data_orig = pickle.load(raw_data_orig, encoding='iso-8859-1')
    raw_data_orig = raw_data_orig[raw_data_orig['description'].notnull()]
    return raw_data_orig


query ="climate change and environmental degradation"

def preprocess(pdf):
    for index, row in pdf.iterrows():
            row['description'] = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split(" "))))))
    return pdf

#preprocess documents
raw_data= preprocess(read_data())


#now preprocess query
query = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(query.split(" "))))))
rownames = raw_data["iati-identifier"]


#vectorise and get tfidf values
vectorizer = TfidfVectorizer()
vectorized_iati = vectorizer.fit_transform(raw_data["description"])
tdm = pd.DataFrame(vectorized_iati.toarray(), columns = vectorizer.get_feature_names())
tdm=tdm.set_index(rownames)

#now vectorise query
vectorized_query=vectorizer.transform(pd.Series(query))
query = pd.DataFrame(vectorized_query.toarray(), columns = vectorizer.get_feature_names())

# get cosine similarity

def cos_sim (pdf, qdf):
    f_similarity={}   
    for index, row in qdf.iterrows():
        for index2, row2 in pdf.iterrows():
             cos_result = cosine_similarity(np.array(row).reshape(1, row.shape[0]), np.array(row2).reshape(1, row2.shape[0]))
             f_similarity[index2] = round(float(cos_result),5)
    return f_similarity

cosine_scores=cos_sim (tdm, query)
#now rank
final_rank= sorted(cosine_scores.items(), key=operator.itemgetter(1), reverse=True)
final_rank = final_rank[0:5]
rownames = rownames.tolist()
unprocessed  = read_data()

for item in final_rank:
    if item[0] in rownames:
         
        print('IATI-IDENTIFIER {0} DESCRIPTION {1}'.format(item[0],unprocessed.iloc[rownames.index(item[0]),2])) 
        
## IATI-IDENTIFIER 41AAA-11960-005 DESCRIPTION UNOPS supports the Global Environment Facility (GEF) Small Grants Programme that helps protect poor, remote villages from the serious effects of climate change and environmental degradation. In an effort to support community-led initiatives, UNOPS efficiently channels direct grants to help communities cope with climate change, conserve biodiversity, protect international waters, reduce the impact of persistent organic pollutants, prevent land degradation, and adopt sustainable forest management practices.
## IATI-IDENTIFIER 41120-8599 DESCRIPTION vOverall Objective:Residents of cities in developing countries and their urban systems begin to become more resilient to the impacts and climate change, and reduce their carbon emissions.We intend to achieve this objectivethrough the followingExpected Accomplishments (EAs):EA1:The urban dimension is introduced into climate change agreements, strategies, policies, laws and regulations, and the climate change dimension is introduced into urban strategies, policies, laws and regulationsEA2:Urban decision-makers and stakeholders have increased capacity to reduce greenhouse gas emissions and adapt to climate change, and the institutions that build their capacities have adapted their teaching curricula and research accordinglyEA3:Cities participating in CCCI develop and begin to implement pro-poor strategies to adapt to climate change and embrace low carbon growth trajectoriesEA4:A global network of partners begins to advocate for policies to help cities better address climate change, access climate finance, and manage knowledge.The underlying development hypothesis here that relates the EAs to our Overall Objective is that together policy change and capacity building will lead to changes in behaviour. For example, policy reform at either the national or the local level (EAs Nos. 1 and 3, respectively), coupled with adequate institutional and human capacity (addressed in EA 2), will lead to actual reduced emissions from cities (part of Overall Objective). For further discussion of our strategy for implementing individual expected accomplishments, please see Section 2 of the CCCI Consolidated Strategy (May 2013), developed for discussion with the CCCI External Advisory Committee (available upon request).CCCI has followed roughly this logical framework for the past several years. However in recent years sharp cuts in annual replenishments, coupled with the increased visibility of cities in international processes with related demands on UN-Habitat, have tended to reduce resources available for city-level work (EA 3). 
## IATI-IDENTIFIER 41119-GW-REGULAR-S12-UNFPA DESCRIPTION UNFPA Guinea-Bissau regular-funded Activities to strengthen national capacity for production and dissemination of quality disaggregated data on population and development issues that allows for mapping of demographic disparities and socio-economic inequalities, and for programming in humanitarian settings activities implemented by UNFPA
## IATI-IDENTIFIER 41119-SS-OTHER-S10-NGO DESCRIPTION UNFPA South Sudan other-funded Activities to increase capacity to prevent gender-based violence and harmful practices and enable the delivery of multisectoral services, including in humanitarian settings activities implemented by NGO
## IATI-IDENTIFIER 41AAA-21461-001 DESCRIPTION Mejoramiento de la infraestructura de dos escuelas agrícolas


 

Exercise:

 

From the IATI10k.csv file, extract a sample of records (for example 100 rows) then do the following:
  1. Put the description column through pre-processing. Make a decision on what preprocessing routines would be suitable.
  2. Set up a suitable query to interrogate the document collection.
  3. Construct the inverted index with tf-idf scores. Ensure that the query is has also been converted to a vector with tf-idf scores.
  4.Compare the query vector with all other vectors in the document collection and calculate cosine similarity. Then store in a dictionary the iati-identifer field as a key with the cosine score as a value in a Python dictionary.
  5. Rank the dictionary by cosine scores (value field in the dictionary) and print the top 10 scores (sort in ascending value).
 

2.8 Classical IR Models: Probability based Information Retrieval

   


 

Use probability to determine relevance. How well does a document satisfy the query ?

An IR sytem has an uncertain understanding of the user query and makes an uncertain guess of whether a document satisifes the query.

Probability theory provides a principled foundation for such reasoning under uncertainty

The query and the documents are all observations from random variables . In the vector-based models, we assume they are vectors, but here we assume they are the data observed from random variables

And so, the problem of retrieval becomes to estimate the probability of relevance

In this category of models, there are different variants.
   

   

Classical probabilistic retrieval models


  Binary Independence Model
  Okapi BM25
  Bayesian networks for text retrieval
  Language model approach to IR
 

Probability Ranking Principle
 


 

Language Modelling Approach to Retrieval - Query Liklihood Retrieval Model


 

In query likelihood, our assumption is that this probability of relevance can be approximated by the probability of query given a document and relevance.
 

How do we compute this conditional probability?
 

This is where we build a Language Model.
 

What is a language model ?
 

“The goal of a language model is to assign a probability to a sequence of words by means of a probability distribution” –Wikipedia

To understand what a language model, must know what is a:
 

• probability distribution
  • discrete random variable
 

In a unigram language model we estimate (and predict) the likelihood of each word independent of any other word
 

Defines a probability distribution over individual words


 

Sequences of words can be assigned a probability by multiplying their individual probabilities:
 

P(university of north carolina) = P(university) x P(of) x P(north) x P(carolina) = (2/20) x (4/20) x (2/20) x (1/20) = 0.0001


  There are two important steps in language modeling
    ‣ estimation: observing text and estimating the probability of each word
  ‣ prediction: using the language model to assign a probability to a span of text.
 

Unigram Language Model Estimation

 

General estimation approach:
 

‣ tokenize/split the text into terms
  ‣ count the total number of term occurrences (N)
  ‣ count the number of occurrences of each term (tft)
  ‣ assign term t a probability equal to
 

Document Language Models

  • Suppose we have a document D, with language model
  • We can use this language model to determine the probability of a particular sequence of text
  • How? We multiple the probability associated with each term in the sequence!
 

Example:-
 


 

Question:
 

What is the probability given by this language model to the sequence of text “rocky is a boxer” or “a boxer is a pet”?
 

To summarise how is the document model estimated for each document?
 


 

Query-Likelihood Retrieval Model: Some Examples

 

• Objective: rank documents based on the probability that they are on the same topic as the query
  • Solution:
  ‣ Score each document (denoted by D) according to the probability given by its language model to the query (denoted by Q)
  ‣ Rank documents in descending order of score
 


 

Every document in the collection is associated with a language model
  • Let denote the language model associated with document D
  • Think of a “black-box”: given a word, it outputs a probability
 

Let P(t|θD) denote the probability given by to term t
 


 

Question:
 

Which would be the top-ranked document and what would be its score?
 


 

P(q|M1) > P(q|M2)


 

Query-Likelihood Retrieval Model: Some Issues

 

There are (at least) two issues with scoring documents based on query terms
 

A document with a single missing query-term will receive a score of zero (similar to boolean AND)
  • Where is IDF?
  • No attempt is made to suppress the contribution of terms that are frequent in the document and also frequent in general (appear in many documents)?
   

Query-Likelihood Retrieval Model: Add One Smoothing and Linear Interpolation

   

• The goal of smoothing is to …
 

‣ Decrease the probability of observed outcomes
  ‣ Increase the probability of unobserved outcomes
 

Add One Smoothing
 


 

A more effective approach to smoothing for information retrieval is called linear interpolation
 

Let denote the language model associated with document D
  • Let denote the language model associated with the entire collection
  • Using linear interpolation, the probability given by the document language model to term *t is:
 


 

As before, a document’s score is given by the probability that it “generated” the query
 

• As before, this is given by multiplying the individual query-term probabilities
 

• However, the probabilities are obtained using the linearly interpolated language model
 

Without smoothing, the query-likelihood model ignores how frequently the term occurs in general!
 


 


 


 

Diving into Code


import nltk
import sys
import codecs
import nltk
from nltk.corpus import stopwords
import csv
import pandas
import re
import numpy as np


   
df = pandas.read_csv('C:/IR Course/Adv -IR/IATI10k.csv', header = 0, encoding="iso-8859-1")
df = df[df.description.notnull()]

def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    
    return data

def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
            data[index2] = entry2.strip()
            
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       stop_words.add(" ")
       new_list = []
       for index, entry in enumerate(data):
               if entry not in stop_words:
                    new_list.append(entry)
       return new_list

def clean_df(pdf):
  
    for index, row in pdf.iterrows():
         row['description'] =  remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split())))  
         row['description'] = " ".join(x for x in row['description'])
    return pdf


def calc_docscore(pdf, pqry):
    col_names =  ['Description', 'score']
    f_df2  = pandas.DataFrame(columns = col_names)
    for index, row in pdf.iterrows(): 
        rank = []
        docscore = 0
        scored = score(row['description'])
        for word in pqry.split(" "):
            if word in scored.keys():
                rank.append(float(scored[word] )+ float(allcounts[word]/total)/2)
        
        if rank != []:
            docscore = np.prod(np.array(rank)) 
           
        f_df2.loc[index] = pandas.Series({'Description':row['description'], 'score':docscore})
    return f_df2
               
def score (pstr):
    fdict = {}
    flist = pstr.split()
    fdict = dict(nltk.FreqDist(flist))
    for  key, value in fdict.items():
        fdict[key] = round(fdict[key]/len(flist),2)
    return fdict

df = clean_df(df)
qry = "reduce transmission of HIV"
qry=  remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(qry.split())))
qry = " ".join(x for x in qry)      

allcounts = {} 
for descript in df['description']:
      tmp = dict(nltk.FreqDist(descript.split()))
      for key, value in tmp.items():
        if key not in allcounts:
            allcounts[key] = value
        else: 
            allcounts[key] = allcounts[key] + value
total = sum(allcounts.values())
df2=calc_docscore(df, qry) 
   
df2sort_by_score = df2.sort_values('score', ascending=False)
print (df2sort_by_score[1:20])
    
  
##                                             Description      score
## 136   goal project contribute reduction hiv incidenc...   0.121396
## 180   sa school-based sexuality hiv prevention educa...   0.121396
## 142   hiv prevention treatment professional sex work...   0.121396
## 143   hiv prevention treatment professional sex work...   0.121396
## 360   ?improving diabetes care prevention piloting h...   0.121396
## 167   reduce hiv/aids prevalence prevention identify...   0.120252
## 9537  unops helps procure retroviral drugs hiv progr...   0.101396
## 311   pace uganda sub mildmay hiv related project fu...   0.101396
## 211   objective program increase use evidence-based ...   0.101396
## 319   ?psi subreceipent project hope usaid award eth...   0.101396
## 281   sfh south africa/psi sub tb hiv care award hts...  0.0913958
## 65    psi kenya helping 3ie build evidence towards u...  0.0913958
## 38    overall goal reduce mortality morbidity due ma...  0.0902525
## 114   expand strengthen high quality community-based...  0.0813958
## 266   goal program scale-up strengthen delivery qual...  0.0813958
## 242   increased access universal hiv prevention serv...  0.0813958
## 316   using user-centered approaches increase adopti...  0.0813958
## 149   global fund project funded zimbabwe ministry h...  0.0802525
## 249   hiv identification treatment program lesotho -...  0.0713958


 

Exercises:
 

Suppose the document collection contains two documents:

\(d_1\): Xyzzy reports a profit but revenue is down
  \(d_2\): Quorus narrows quarter loss but revenue decreases further
 

The query is: “revenue down”
 

Calculate maximum liklihood estimates for terms in document 1 and document 2.
  Apply the linear interpolation and calulate the score for query/document 1 and query/document 2.
 

  1. Below are 4 mini documents, used previously.
     

D1: He likes to wink, he likes to drink
  D2: He likes to drink, and drink, and drink
  D3: The thing he likes to drink is ink
  D4: The ink he likes to drink is pink
  D5: He likes to wink, and drink pink ink
 

Query: “drink pink ink”
 

Write Python code to do the following:
 

  • lower case text, remove punctuation
     
  • apply linear interpolation to calculate score for document and query
     

3.0 Future of IR - Challenges Ahead

 

Data Volume
 

Amount of digital data in 2007: 281 exabytes = 281 trillion digitized novels
 

“Every 2 days now, we create as much information as we did from the dawn of civilisation up until 2003”
 

Eric Schmidt
 

Vocabulary mismatch problem due to synonymy and polysemy.
    The same word has different meanings.
  A search engine might not be able to guess the right meaning if appropriate contexts are not provided.
 

IR-systems are as good as the query provided to them.
    Queries are provided by the human, and human is the weak link in this chain.
  So, high quality query is a must. With a very bad query, you can defeat any search engine.
 

A search query is : Windows


For the query, a search engine (like- Google) can show results of three types as following:

Computer OS: Wind ows
  Windows of buildings
  Combination of both (1) and (2)
   

It is not the intention of a search engine to provide results of type 3, i.e., combined results of Windows of OS and buildings.
  Because, a user, who works in a building company, might not want Computer OS Windows as the output of the query.
  The output should be building windows for this type of users.
 

On the other-hand, similarly. another user working as a software engineer, should get the output of Windows OS for the query.
 

This type of query results based on person’s interests is called personalized search engines.
  It is one of the most challenging sides of Information Retrieval (IR) to provide results based on person’s interests and ranked the results accordingly.


       

Chapter 3: Classificiation


 

Intended Learning Outcomes: By the end of Chapter 3 you will be able to:-

  • Define key terms in Information Retrieval (IR).


 

3.1 What is text classification ?


 

Why do we need to classify texts?
 

As an independent task
 


  As a part of more complicated NLP tasks

 

  • Data Filtering
  • Intent Classification in dialog systems
  • Hybrid Machine translation systems


 

Text Classification in general
 


 


 

3.2 High-level overview of the workflow


 


 

The dataset used in this project is the BBC News Raw Dataset. It can be downloaded from here It consists of 2.225 documents from the BBC news website corresponding to stories in five topical areas from 2004 to 2005. These areas are:
  Business
  Entertainment
  Politics
  Sport
  Tech
     

3.2 Exploratory Data Analysis


  It is a common practice to carry out an exploratory data analysis in order to gain some insights from the data.
 

One of our main concerns when developing a classification model is whether the different classes are balanced . This means that the dataset contains an approximately equal portion of each class. For example, if we had two classes and a 95% of observations belonging to one of them, a dumb classifier which always output the majority class would have 95% accuracy, although it would fail all the predictions of the minority class. There are several ways of dealing with imbalanced datasets . One first approach is to undersample the majority class and oversample the minority one, so as to obtain a more balanced dataset. Other approach can be using other error metrics beyond accuracy such as the precision , the recall or the F1-score . We’ll talk more about these metrics later. Looking at our data, we can get the % of observations belonging to each class:
 

Diving into Code

library(RColorBrewer)
df_final <- read.csv(file = "C:/data/BBC/News_dataset.csv", sep=';')

# Simple Horizontal Bar Plot with Added Labels
counts <- table(df_final$Category)
coul <- brewer.pal(5, "Set2") 
barplot(counts, main="Number of articles in each category",   names.arg=unique(df_final$Category), ylab="Number of Articles", col=coul)
...

num_of_articles <-as.vector(nchar(as.character(df_final$Content)))
df_final$Content <- as.vector(num_of_articles)
df_final <-df_final[df_final$Content < quantile(df_final$Content, 0.95), ]

boxplot(df_final$Content~Category,
        data=df_final,
        main="Different boxplots for each Category",
        xlab="Categories",
        ylab="Length of Article",
        col=coul,
        border="black"
)
...


  [Link to Python Jupyter Notebook]https://github.com/miguelfzafra/Latest-News-Classifier/blob/master/0.%20Latest%20News%20Classifier/02.%20Exploratory%20Data%20Analysis/02.%20Exploratory%20Data%20Analysis.ipynb


 

3.3 Feature engineering


  As for many ML tasks, it is possible to generate useful feature. For example:
 

  • General statistics: text length, text length variance
  • Scores from tagged lists

    • Sentiment dictionaries: SentiWordNet, SentiWords
    • Subjectivity/Objectivity distionaries: MPQA


* Syntactic features
* POS tags
* Ad-hoc features: eg number of emojis

Feature engineering is an essential part of building any intelligent system. As Andrew Ng says:
    “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
    Feature engineering is the process of transforming data into features to act as inputs for machine learning models such that good quality features help in improving the model performance.
  When dealing with text data, there are several ways of obtaining features that represent the data. A few common methods are delineated below.
 

3.3.1 Text representation

 

In order to represent our text, every row of the dataset will be a single document of the corpus. The columns (features) will be different depending of which feature creation method we choose:
  Word Count Vectors
  With this method, every column is a term from the corpus, and every cell represents the frequency count of each term in each document.
  TF–IDF Vectors
  TF-IDF is a score that represents the relative importance of a term in the document and the entire corpus.
  See NLP Intro course for further explanation

These two methods (Word Count Vectors and TF-IDF Vectors) are often named <color=“red”>Bag of Words methods, since the order of the words in a sentence is ignored. The following methods are more advanced as they somehow preserve the order of the words and their lexical considerations.
Word Embeddings
  The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.
See NLP Intro course for further explanation
  Text based or NLP based featuress
  We can manually create any feature that we think may be of importance when discerning between categories (i.e. word density, number of characters or words, etc…). We can also use NLP based features using Part of Speech models, which can tell us, for example, if a word is a noun or a verb, and then use the frequency distribution of the PoS tags. Topic Models
  Methods such as Latent Dirichlet Allocation try to represent every topic by a probabilistic distribution over words, in what is known as topic modeling
 

TF-IDF vectors have been chosen represent the documents in our corpus due to its simplicity and speed in the creation of vectors.
 

ngram_range: We want to consider both unigrams and bigrams.
  max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold.
  min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
  max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
  See TfidfVectorizer in scilearn for further detail.
 

When creating the features with this method, some parameters have to be chosen:
  N-gram range: unigrams, bigrams, trigrams ?
  Maximum/Minimum Document Frequency: when building the vocabulary, ignore terms that have a document frequency strictly higher/lower than the given threshold.
  Maximum features: Choose the top N features ordered by term frequency across the corpus.
  The following parameters have been chosen:
 


   

3.4 Text cleaning

 

Before creating any feature from the raw text, we must perform a cleaning process to ensure no distortions are introduced to the model. used.
  See NLP Intro course for further explanation
 

3.5 Label coding

 

Machine learning models require numeric features and labels to provide a prediction. A dictionary to map each label to a numerical ID has been created. This mapping scheme is as below:
 


 

3.6 Train — test split

  A test set needs to be set up in order to prove the quality of the models when predicting unseen data.
A random split with 85% of the observations composing the training test and 15% of the observations composing the test set will be established.
Hyperparameter tuning process with cross validation in the training data, fit the final model to it and then evaluate it with totally unseen data so as to obtain an evaluation metric as less biased as possible.

 




import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
import random
import nltk
from nltk.stem import WordNetLemmatizer 
import re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import os
os.chdir("C:/IR Course/Adv -IR/")
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import ShuffleSplit
plt.style.use('ggplot')

df_path = 'C:/data/BBC/'
df_path2 = df_path + 'News_dataset.csv'
df = pd.read_csv(df_path2, sep=';')
df['News_Length']= df['Content'].apply(len)
df.head()

#some basic cleaning
# \r and \n
##   File_Name  ... News_Length
## 0   001.txt  ...        2569
## 1   002.txt  ...        2257
## 2   003.txt  ...        1557
## 3   004.txt  ...        2421
## 4   005.txt  ...        1575
## 
## [5 rows x 5 columns]
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")



def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    return data


def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       new_list = []
       for index, entry in enumerate(data):
           no_stopwords = ""
           entry = entry.split()
           for word in entry:
               if word not in stop_words:
                    no_stopwords = no_stopwords + " " + word 
           new_list.append(no_stopwords)
       return new_list

def lemmatiser (pdf, pcol):
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized_text_list = []
    
   
    
    for row in range(len(pdf)):
        
        
        # Create an empty list containing lemmatized words
        lemmatized_list = []
        
        # Save the text and its words into an object
        text = pdf.loc[row, pcol]
        #print(text)
       
        text_words = text.split(" ")
    
        # Iterate through every word to lemmatize
        for word in text_words:
            lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
            
        # Join the list
        lemmatized_text = " ".join(lemmatized_list)
        
        # Append to the list containing the texts
        lemmatized_text_list.append(lemmatized_text)
    return lemmatized_text_list

df_path = 'C:/data/BBC/'
df_path2 = df_path + 'News_dataset.csv'
df = pd.read_csv(df_path2, sep=';')
df['News_Length']= df['Content'].apply(len)

# \r and \n
df['Content_Parsed_1'] = df['Content'].str.replace("\r", " ")
df['Content_Parsed_1'] = df['Content_Parsed_1'].str.replace("\n", " ")

# Lowercasing the text
df['Content_Parsed_2'] = df['Content_Parsed_1'].str.lower()
# remove punctuation
df['Content_Parsed_3'] = pd.Series(remove_punctuation (list(df['Content_Parsed_2'])))
#remove possessive
df['Content_Parsed_4'] = df['Content_Parsed_3'].str.replace("'s", "")

#lemmatise
df['Content_Parsed_5'] = lemmatiser (df, 'Content_Parsed_4')

df['Content_Parsed_6'] = df['Content_Parsed_5']

#remove stopwords
df['Content_Parsed_6'] = pd.Series(remove_stopwords_from_tokens(list(df['Content_Parsed_6'])))

list_columns = ["File_Name", "Category", "Complete_Filename", "Content", "Content_Parsed_6"]
df = df[list_columns]

df = df.rename(columns={'Content_Parsed_6': 'Content_Parsed'})

print(df.loc[3,'Content_Parsed'])
##  high fuel price hit ba profit british airways blame high fuel price 40 drop profit report result three months 31 december 2004 airline make pre tax profit â 75m 141m compare â 125m year earlier rod eddington ba chief executive say result respectable third quarter fuel cost rise â 106m 47 3 ba profit still better market expectation â 59m expect rise full year revenues help offset increase price aviation fuel ba last year introduce fuel surcharge passengers october increase â 6 â 10 one way long haul flight short haul surcharge raise â 2 50 â 4 leg yet aviation analyst mike powell dresdner kleinwort wasserstein say ba estimate annual surcharge revenues â 160m still way short additional fuel cost predict extra â 250m turnover quarter 4 3 â 1 97bn benefit rise cargo revenue look ahead full year result march 2005 ba warn yield average revenues per passenger expect decline continue lower price face competition low cost carriers however say sales would better previously forecast year march 2005 total revenue outlook slightly better previous guidance 3 3 5 improvement anticipate ba chairman martin broughton say ba previously forecast 2 3 rise full year revenue also report friday passenger number rise 8 1 january aviation analyst nick van den brul bnp paribas describe ba latest quarterly result pretty modest quite good revenue side show impact fuel surcharge positive cargo development however operate margins cost impact fuel strong say since 11 september 2001 attack unite state ba cut 13 000 job part major cost cut drive focus remain reduce controllable cost debt whilst continue invest products mr eddington say example take delivery six airbus a321 aircraft next month start improvements club world flat bed ba share close four pence 274 5 pence
df.head(1)
##   File_Name  ...                                     Content_Parsed
## 0   001.txt  ...   ad sales boost time warner profit quarterly p...
## 
## [1 rows x 5 columns]


 

ADD LABELS


 

category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

# Category mapping
df['Category_Code'] = df['Category']
df = df.replace({'Category_Code':category_codes})


## ensure no other category in dataframe

for index, row in df.iterrows():
    if row['Category_Code'] not in [0,1,2,3,4]:
         df = df.drop (index)


df.tail()
##      File_Name  ... Category_Code
## 2220   397.txt  ...             4
## 2221   398.txt  ...             4
## 2222   399.txt  ...             4
## 2223   400.txt  ...             4
## 2224   401.txt  ...             4
## 
## [5 rows x 6 columns]


 

TRAIN - TEST SPLIT


 

To prove the quality of the model a subset of the data is set apart for testing. The training data is used to tune hyperparameters and then test performance on the unseen data of the test set.

Test set size of 15% of the full dataset is used.


 

X_train, X_test, y_train, y_test = train_test_split(df['Content_Parsed'], 
                                                    df['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)


 

Parameters for tfidfvectorizer


# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300


 

tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
                        
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
labels_train = np.array(labels_train, dtype=np.int)

#training data
print(features_train.shape)
## (1891, 300)


 

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
labels_test = np.array(labels_test, dtype=np.int)


#test data
print(features_test.shape)
## (334, 300)
for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")
## # 'business' category:
##   . Most correlated unigrams:
## . firm
## . market
## . economy
## . growth
## . bank
##   . Most correlated bigrams:
## . last year
## . year old
## 
## # 'entertainment' category:
##   . Most correlated unigrams:
## . tv
## . music
## . star
## . award
## . film
##   . Most correlated bigrams:
## . mr blair
## . prime minister
## 
## # 'politics' category:
##   . Most correlated unigrams:
## . minister
## . blair
## . election
## . party
## . labour
##   . Most correlated bigrams:
## . prime minister
## . mr blair
## 
## # 'sport' category:
##   . Most correlated unigrams:
## . win
## . side
## . game
## . team
## . match
##   . Most correlated bigrams:
## . say mr
## . year old
## 
## # 'tech' category:
##   . Most correlated unigrams:
## . digital
## . technology
## . computer
## . software
## . users
##   . Most correlated bigrams:
## . year old
## . say mr


 

3.7 Model Training


 

Once feature vectors are built, machine learning classification models are used to find the one performs best on our data. We will try the following 2 models:


 

  • Random Forest
  • Support Vector Machine

The methodology used to train each model is as follows:

  • Decide on which hyperparameters to tune.
  • Define the metric to use when measuring the performance of a model.
  • Once best combination of hyperparameters is reached it is used to obtain the accuracy of the training data and the test data, the classification report and the confusion matrix.


 

svc_0= svm.SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=True, random_state=8, shrinking=True, tol=0.001,
  verbose=False)


print('Parameters currently in use:\n')
## Parameters currently in use:
print(svc_0.get_params())
## {'C': 0.1, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'auto', 'kernel': 'linear', 'max_iter': -1, 'probability': True, 'random_state': 8, 'shrinking': True, 'tol': 0.001, 'verbose': False}


 

svc_0.fit(features_train, labels_train)
## SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
##     decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
##     max_iter=-1, probability=True, random_state=8, shrinking=True, tol=0.001,
##     verbose=False)
svc_pred = svc_0.predict(features_test)


 

print("The training accuracy is: ")
## The training accuracy is:
print(accuracy_score(labels_train, svc_0.predict(features_train)))
## 0.9603384452670545


 

print("Classification report")
## Classification report
print(classification_report(labels_test,svc_pred))
##               precision    recall  f1-score   support
## 
##            0       0.87      0.98      0.92        81
##            1       0.96      0.96      0.96        49
##            2       0.97      0.88      0.92        72
##            3       0.99      0.99      0.99        72
##            4       0.93      0.88      0.91        60
## 
##     accuracy                           0.94       334
##    macro avg       0.94      0.94      0.94       334
## weighted avg       0.94      0.94      0.94       334


 

Chapter 4: Sentiment Analysis/Classification/Opinion Mining


  Intended Learning Outcomes
  1. Describe sentiment analysis.
  2. Describe how sentiment analysis could be used to determine sentiment around a new product.
  3. Give examples where sentiment in language may prove to be challenging to detect.
  4. List advantages of sentiment analysis systems.
  4. Describe how sentiment analysis can be undertaken.
  5. Identify main steps involved in building a machine learning based sentiment classifer.
  6. Take a data set and build a supervised based machine learning sentiment analysis classifer in Python.
 

4.1 What is Sentiment Analysis

    Sentiment analysis is a text classification task. Give a phrase, or a list of phrases the classifier should indicate if the phrase is positive, negative or neutral.
 

Sentiment Analysis systems typically identify the following attributes of an expression:
    Polarity: if the speaker express a positive or negative opinion.
Subject: the thing that is being talked about.
Opinion holder: the person, or entity that expresses the opinion.
 


 

In essence: “It is the process of determining the emotional tone behind a series of words, used to gain an understanding of the the attitudes, opinions and emotions expressed within an online mention” (Bannister, 2018).

Used in:


  • Social Media Monitoring
  • Electioneering
  • Market Research/CUstomer Service

Sentiment analysis systems allows companies to make sense of the sea of unstructured text by automating business processes, getting actionable insights, and saving hours of manual data processing.

  Sentiment analysis can be applied at different levels of scope:
  Document level sentiment analysis obtains the sentiment of a complete document or paragraph.
Sentence level sentiment analysis obtains the sentiment of a single sentence.
Sub-sentence level sentiment analysis obtains the sentiment of sub-expressions within a sentence.
 

There are many types and flavors of sentiment analysis am systems that focus on polarity (positive, negative, neutral) to systems that detect feelings and emotions (angry, happy, sad, etc) or identify intentions (e.g. interested v. not interested).
 

For example Aspect-based sentiment analysis indicates sentiment on different features related to a product.
 


Another example is Intent Analysis basically detects what people want to do with a text rather than what people say with that text. Look at the following examples:

“Your customer support is a disaster. I’ve been on hold for 20 minutes”.

“I would like to know how to replace the cartridge”.

“Can you help me fill out this form?”
  A human being has no problems detecting the complaint in the first text, the question in the second text, and the request in the third text. However, machines can have some problems to identify those. Sometimes, the intended action can be inferred from the text, but sometimes, inferring it requires some contextual knowledge.
  ### 4.2 Challenges in Sentiment Analysis

Human language is complex. Teaching a machine to analyse the various grammatical nuances, cultural variations, slang and misspellings that occur in online mentions is a difficult process. Teaching a machine to understand how context can affect tone is even more difficult.
 

Contextual understanding
  Contextual understanding is pivotal for accurate sentiment detection.
  For example: “I am craving McDonald’s so bad”.
  Most systems will misinterpret this statement as negative by seeing the the phrase “so bad”
  Sentiment Ambiguity
  “Can you recommend any good holiday destinations?”
  This statement doesn’t express any sentiment, although it uses the positive sentiment word “good”
  Sarcasm
  “This phone has an awesome battery back-up of 2 hours.”
  Obviously, this statement is negative, even though it has the positive word “happy”
  Comparatives
  “Iphone is much better than Samsung.”
  Most Sentiment analyser tools cannot “pick sides” when they find comparative statements like the one mentioned here, they can only pick the sentiment based on keywords.
 

4.3 Advantages of sentiment analysis

 

Scalability:
  There’s just too much data to process manually. Sentiment analysis allows processing of data at scale in a efficient and cost-effective way.
 

Real-time analysis:
  Sentiment analysis can to identify critical information that allows situational awareness during specific scenarios in real-time.
 

Consistent criteria:
  By using a centralized sentiment analysis system, companies can apply the same criteria to all of their data. This helps to reduce errors and improve data consistency.
 

4.4 Practical Sentiment Analysis

 

There are many methods and algorithms to implement sentiment analysis systems, which can be classified as:
  Rule-based systems that perform sentiment analysis based on a set of manually crafted rules.
  Automatic systems that rely on machine learning techniques to learn from data.
  Hybrid systems that combine both rule based and automatic approaches.
 

Below, are steps that are typically undertaken in the building an automatic sentiment analysis system.
 


 

In a more practical sense, the objective here is to takes text and produces a label (or labels) that summarizes the sentiment of this text, e.g. positive, neutral, and negative.
 

To solve this problem, typical machine learning pipeline is followed.
 

1.Import the required libraries and the dataset.
  2.Exploratory data analysis.
  3.Text Preprocessing.
  4.Apply machine learning algorithms to train and test our sentiment analysis models.
 

Import the Dataset
 

airline_tweets = pd.read_csv("C:/data/tweets/Tweets.csv")
airline_tweets.head()
##              tweet_id  ...               user_timezone
## 0  570306133677760513  ...  Eastern Time (US & Canada)
## 1  570301130888122368  ...  Pacific Time (US & Canada)
## 2  570301083672813571  ...  Central Time (US & Canada)
## 3  570301031407624196  ...  Pacific Time (US & Canada)
## 4  570300817074462722  ...  Pacific Time (US & Canada)
## 
## [5 rows x 15 columns]


 

Data Analysis
 


import operator
import pandas as pd
import re
import sklearn
from sklearn.decomposition import PCA
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.collocations import *
from nltk.corpus import webtext
import numpy as np
import random
import pickle
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity  

airline_tweets = pd.read_csv("C:/data/tweets/Tweets.csv")
airline_tweets.head()
##              tweet_id  ...               user_timezone
## 0  570306133677760513  ...  Eastern Time (US & Canada)
## 1  570301130888122368  ...  Pacific Time (US & Canada)
## 2  570301083672813571  ...  Central Time (US & Canada)
## 3  570301031407624196  ...  Pacific Time (US & Canada)
## 4  570300817074462722  ...  Pacific Time (US & Canada)
## 
## [5 rows x 15 columns]


 

library(reticulate)
reticulate::repl_python()

plot_size = plt.rcParams["figure.figsize"] 
print(plot_size[0]) 
print(plot_size[1])

plot_size[0] = 8
plot_size[1] = 6
plt.rcParams["figure.figsize"] = plot_size 

airline_tweets.airline.value_counts().plot(kind='pie', autopct='%1.0f%%')


 


 

airline_tweets.airline_sentiment.value_counts().plot(kind='pie', autopct='%1.0f%%', colors=["red", "yellow", "green"])


 


 

import seaborn as sns

airline_sentiment = airline_tweets.groupby(['airline', 'airline_sentiment']).airline_sentiment.count().unstack()
airline_sentiment.plot(kind='bar')


 


 

Data Cleaning
 

Tweets contain many slang words and punctuation marks. The tweets will have to be cleaned before they can be used for training the machine learning model.

Before cleaning the tweets, the dataset should be split into into feature and label sets.
 

The feature set will consist of tweets only. The label set will consist of the sentiment of the tweet that we have to predict. The tweet test is in the 10th column. The sentiment of the tweet is in the second column (index 1). To create a feature and a label set, we can use the iloc method off the pandas data frame.
 

features = airline_tweets.iloc[:, 10].values
labels = airline_tweets.iloc[:, 1].values

print(features[1:5])
## ["@VirginAmerica plus you've added commercials to the experience... tacky."
##  "@VirginAmerica I didn't today... Must mean I need to take another trip!"
##  '@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse'
##  "@VirginAmerica and it's a really big bad thing about it"]
print(labels[1:5])
## ['positive' 'neutral' 'negative' 'negative']


 

Now, the tweets should be cleaned. Functions developed in the introduction to NLP could be used, howevever knowing how to use regular expression is a key skill in NLP and will be used here to clean the text. Further information on how to use regular expressions in Python can be found here: https://stackabuse.com/using-regex-for-text-manipulation-in-python/


 

processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)
print (processed_feature[12])


Representing Text in Numeric Form
 

To make statistical algorithms work with text, we first have to convert text to numbers. See Intro to NLP, section on vectorisation.
The tweets will be scored using TF-IDF mechanism.
 

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
processed_features = vectorizer.fit_transform(processed_features).toarray()


 

In the code above, we define that the max_features should be 2500, which means that it only uses the
  2500 most frequently occurring words to create a bag of words feature vector. Words that occur less frequently are not very useful for classification.
 

Similarly, max_df specifies that only use those words that occur in a maximum of 80% of the documents.
 Words that occur in all documents are too common and are not very useful for classification.
  Similarly, min-df is set to 7 which shows that include words that occur in at least 7 documents.


  Dividing Data into Training and Test Sets
    Before we train our algorithms, we need to divide our data into training and testing sets. The training set will be used to train the algorithm while the test set will be used to evaluate the performance of the machine learning model.
 

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)


The train_test_split class from the sklearn.model_selection module to divide our data into training and testing set. The method takes the feature set as the first parameter, the label set as the second parameter, and a value for the test_size parameter. A value of 0.2 for test_size is specified which means that our data set will be split into two sets of 80% and 20% data. The 80% dataset will be used for training and 20% dataset for testing.
 

Training the Model
 

The Random Forest algorithm will be used due to its ability to act upon non-normalized data.
 

The sklearn.ensemble module contains the RandomForestClassifier class that can be used to train the machine learning model using the random forest algorithm.
First a call to the fit method on the RandomForestClassifier class is performed. This is then passed to the training features and labels, as parameters.
 

from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
##                        max_depth=None, max_features='auto', max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=200,
##                        n_jobs=None, oob_score=False, random_state=0, verbose=0,
##                        warm_start=False)


  Making Predictions and Evaluating the Model
    Once the model has been trained, the last step is to make predictions on the model.
To do so, a call is made to the predict method on the object of the RandomForestClassifier class that we used for training.
Look at the following script:
 

predictions = text_classifier.predict(X_test)


  Finally, to evaluate the performance of the machine learning models, classification metrics such as a confusion metrix, F1 measure, accuracy are employed as shown below.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))
## [[1723  108   39]
##  [ 326  248   40]
##  [ 132   58  254]]
print(classification_report(y_test,predictions))
##               precision    recall  f1-score   support
## 
##     negative       0.79      0.92      0.85      1870
##      neutral       0.60      0.40      0.48       614
##     positive       0.76      0.57      0.65       444
## 
##     accuracy                           0.76      2928
##    macro avg       0.72      0.63      0.66      2928
## weighted avg       0.75      0.76      0.74      2928
print(accuracy_score(y_test, predictions))
## 0.7599043715846995


 

Different Python libraries were used to contribute to performing sentiment analysis. An analysis was done on public tweets regarding six US airlines and achieved an accuracy of around 75%.

Exercises:
 

References:
 

https://medium.com/seek-blog/your-guide-to-sentiment-analysis-344d43d225a7 https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/ https://monkeylearn.com/sentiment-analysis/